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Abstract 


Component failure in large-scale IT installations is be- 
coming an ever larger problem as the number of compo- 
nents in a single cluster approaches a million. 

In this paper, we present and analyze field-gathered 
disk replacement data from a number of large production 
systems, including high-performance computing sites 
and internet services sites. About 100,000 disks are cov- 
ered by this data, some for an entire lifetime of five years. 
The data include drives with SCSI and FC, as well as 
SATA interfaces. The mean time to failure (MTTF) of 
those drives, as specified in their datasheets, ranges from 
1,000,000 to 1,500,000 hours, suggesting a nominal an- 
nual failure rate of at most 0.88%. 

We find that in the field, annual disk replacement rates 
typically exceed 1%, with 2-4% common and up to 13% 
observed on some systems. This suggests that field re- 
placement is a fairly different process than one might 
predict based on datasheet MTTF. 

We also find evidence, based on records of disk re- 
placements in the field, that failure rate is not constant 
with age, and that, rather than a significant infant mor- 
tality effect, we see a significant early onset of wear-out 
degradation. That is, replacement rates in our data grew 
constantly with age, an effect often assumed not to set in 
until after a nominal lifetime of 5 years. 

Interestingly, we observe little difference in replace- 
ment rates between SCSI, FC and SATA drives, poten- 
tially an indication that disk-independent factors, such as 
operating conditions, affect replacement rates more than 
component specific factors. On the other hand, we see 
only one instance of a customer rejecting an entire pop- 
ulation of disks as a bad batch, in this case because of 
media error rates, and this instance involved SATA disks. 

Time between replacement, a proxy for time between 
failure, is not well modeled by an exponential distribu- 
tion and exhibits significant levels of correlation, includ- 
ing autocorrelation and long-range dependence. 


1 Motivation 


Despite major efforts, both in industry and in academia, 
high reliability remains a major challenge in running 
large-scale IT systems, and disaster prevention and cost 
of actual disasters make up a large fraction of the to- 
tal cost of ownership. With ever larger server clus- 
ters, maintaining high levels of reliability and avail- 
ability is a growing problem for many sites, including 
high-performance computing systems and internet ser- 
vice providers. A particularly big concern is the reliabil- 
ity of storage systems, for several reasons. First, failure 
of storage can not only cause temporary data unavailabil- 
ity, but in the worst case it can lead to permanent data 
loss. Second, technology trends and market forces may 
combine to make storage system failures occur more fre- 
quently in the future [24]. Finally, the size of storage 
systems in modern, large-scale IT installations has grown 
to an unprecedented scale with thousands of storage de- 
vices, making component failures the norm rather than 
the exception [7]. 

Large-scale IT systems, therefore, need better system 
design and management to cope with more frequent fail- 
ures. One might expect increasing levels of redundancy 
designed for specific failure modes [3, 7], for exam- 
ple. Such designs and management systems are based on 
very simple models of component failure and repair pro- 
cesses [22]. Better knowledge about the statistical prop- 
erties of storage failure processes, such as the distribu- 
tion of time between failures, may empower researchers 
and designers to develop new, more reliable and available 
storage systems. 

Unfortunately, many aspects of disk failures in real 
systems are not well understood, probably because the 
owners of such systems are reluctant to release failure 
data or do not gather such data. As a result, practi- 
tioners usually rely on vendor specified parameters, such 
as mean-time-to-failure (MTTF), to model failure pro- 
cesses, although many are skeptical of the accuracy of 


those models [4, 5, 33]. Too much academic and cor- 
porate research is based on anecdotes and back of the 
envelope calculations, rather than empirical data [28]. 

The work in this paper is part of a broader research 
agenda with the long-term goal of providing a better un- 
derstanding of failures in IT systems by collecting, ana- 
lyzing and making publicly available a diverse set of real 
failure histories from large-scale production systems. In 
our pursuit, we have spoken to a number of large pro- 
duction sites and were able to convince several of them 
to provide failure data from some of their systems. 

In this paper, we provide an analysis of seven data sets 
we have collected, with a focus on storage-related fail- 
ures. The data sets come from a number of large-scale 
production systems, including high-performance com- 
puting sites and large internet services sites, and consist 
primarily of hardware replacement logs. The data sets 
vary in duration from one month to five years and cover 
in total a population of more than 100,000 drives from at 
least four different vendors. Disks covered by this data 
include drives with SCSI and FC interfaces, commonly 
represented as the most reliable types of disk drives, as 
well as drives with SATA interfaces, common in desktop 
and nearline systems. Although 100,000 drives is a very 
large sample relative to previously published studies, it 
is small compared to the estimated 35 million enterprise 
drives, and 300 million total drives built in 2006 [1]. Phe- 
nomena such as bad batches caused by fabrication line 
changes may require much larger data sets to fully char- 
acterize. 

We analyze three different aspects of the data. We be- 
gin in Section 3 by asking how disk replacement frequen- 
cies compare to replacement frequencies of other hard- 
ware components. In Section 4, we provide a quantitative 
analysis of disk replacement rates observed in the field 
and compare our observations with common predictors 
and models used by vendors. In Section 5, we analyze 
the statistical properties of disk replacement rates. We 
study correlations between disk replacements and iden- 
tify the key properties of the empirical distribution of 
time between replacements, and compare our results to 
common models and assumptions. Section 6 provides an 
overview of related work and Section 7 concludes. 


2 Methodology 


2.1 What is a disk failure? 


While it is often assumed that disk failures follow a 
simple fail-stop model (where disks either work per- 
fectly or fail absolutely and in an easily detectable man- 
ner [22, 24]), disk failures are much more complex in 
reality. For example, disk drives can experience latent 
sector faults or transient performance problems. Often it 


is hard to correctly attribute the root cause of a problem 
to a particular hardware component. 

Our work is based on hardware replacement records 
and logs, i.e. we focus on disk conditions that lead a drive 
customer to treat a disk as permanently failed and to re- 
place it. We analyze records from a number of large pro- 
duction systems, which contain a record for every disk 
that was replaced in the system during the time of the 
data collection. To interpret the results of our work cor- 
rectly it is crucial to understand the process of how this 
data was created. After a disk drive is identified as the 
likely culprit in a problem, the operations staff (or the 
computer system itself) perform a series of tests on the 
drive to assess its behavior. If the behavior qualifies as 
faulty according to the customer’s definition, the disk is 
replaced and a corresponding entry is made in the hard- 
ware replacement log. 

The important thing to note is that there is not one 
unique definition for when a drive is faulty. In partic- 
ular, customers and vendors might use different defini- 
tions. For example, a common way for a customer to test 
a drive is to read all of its sectors to see if any reads ex- 
perience problems, and decide that it is faulty if any one 
operation takes longer than a certain threshold. The out- 
come of such a test will depend on how the thresholds 
are chosen. Many sites follow a “better safe than sorry” 
mentality, and use even more rigorous testing. As a re- 
sult, it cannot be ruled out that a customer may declare 
a disk faulty, while its manufacturer sees it as healthy. 
This also means that the definition of “faulty” that a drive 
customer uses does not necessarily fit the definition that 
a drive manufacturer uses to make drive reliability pro- 
jections. In fact, a disk vendor has reported that for 43% 
of all disks returned by customers they find no problem 
with the disk [1]. 

It is also important to note that the failure behavior 
of a drive depends on the operating conditions, and not 
only on component level factors. For example, failure 
rates are affected by environmental factors, such as tem- 
perature and humidity, data center handling procedures, 
workloads and “duty cycles” or powered-on hours pat- 
terns. 

We would also like to point out that the failure behav- 
ior of disk drives, even if they are of the same model, can 
differ, since disks are manufactured using processes and 
parts that may change. These changes, such as a change 
in a drive’s firmware or a hardware component or even 
the assembly line on which a drive was manufactured, 
can change the failure behavior of a drive. This effect 
is often called the effect of batches or vintage. A bad 
batch can lead to unusually high drive failure rates or un- 
usually high rates of media errors. For example, in the 
HPC3 data set (Table 1) the customer had 11,000 SATA 
drives replaced in Oct. 2006 after observing a high fre- 
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Table 1: Overview of the seven failure data sets. Note that the disk count given in the table is the number of drives in 
the system at the end of the data collection period. For some systems the number of drives changed during the data 
collection period, and we account for that in our analysis. The disk parameters 10K and 15K refer to the rotation 
speed in revolutions per minute; drives not labeled 10K or 15K probably have a rotation speed of 7200 rpm. 


quency of media errors during writes. Although it took 
a year to resolve, the customer and vendor agreed that 
these drives did not meet warranty conditions. The cause 
was attributed to the breakdown of a lubricant leading to 
unacceptably high head flying heights. In the data, the 
replacements of these drives are not recorded as failures. 

In our analysis we do not further study the effect of 
batches. We report on the field experience, in terms of 
disk replacement rates, of a set of drive customers. Cus- 
tomers usually do not have the information necessary to 
determine which of the drives they are using come from 
the same or different batches. Since our data spans a 
large number of drives (more than 100,000) and comes 
from a diverse set of customers and systems, we as- 
sume it also covers a diverse set of vendors, models and 
batches. We therefore deem it unlikely that our results 
are significantly skewed by “bad batches”. However, we 
caution the reader not to assume all drives behave identi- 
cally. 


2.2 Specifying disk reliability and failure 
frequency 


Drive manufacturers specify the reliability of their prod- 
ucts in terms of two related metrics: the annualized fail- 
ure rate (AFR), which is the percentage of disk drives in 
a population that fail in a test scaled to a per year esti- 
mation; and the mean time to failure (MTTF). The AFR 
of a new product is typically estimated based on accel- 
erated life and stress tests or based on field data from 
earlier products [2]. The MTTF is estimated as the num- 
ber of power on hours per year divided by the AFR. A 


common assumption for drives in servers is that they are 
powered on 100% of the time. Our data set providers 
all believe that their disks are powered on and in use at 
all times. The MTTFs specified for today’s highest qual- 
ity disks range from 1,000,000 hours to 1,500,000 hours, 
corresponding to AFRs of 0.58% to 0.88%. The AFR 
and MTTF estimates of the manufacturer are included in 
a drive’s datasheet and we refer to them in the remainder 
as the datasheet AFR and the datasheet MTTF. 

In contrast, in our data analysis we will report the 
annual replacement rate (ARR) to reflect the fact that, 
strictly speaking, disk replacements that are reported in 
the customer logs do not necessarily equal disk failures 
(as explained in Section 2.1). 


2.3 Data sources 


Table 1 provides an overview of the seven data sets used 
in this study. Data sets HPC1, HPC2 and HPC3 were 
collected in three large cluster systems at three differ- 
ent organizations using supercomputers. Data set HPC4 
was collected on dozens of independently managed HPC 
sites, including supercomputing sites as well as commer- 
cial HPC sites. Data sets COM1, COM2, and COM3 
were collected in at least three different cluster systems 
at a large internet service provider with many distributed 
and separately managed sites. In all cases, our data re- 
ports on only a portion of the computing systems run 
by each organization, as decided and selected by our 
sources. 

It is important to note that for some systems the num- 
ber of drives in the system changed significantly during 


the data collection period. While the table provides only 
the disk count at the end of the data collection period, 
our analysis in the remainder of the paper accounts for 
the actual date of these changes in the number of drives. 
Second, some logs also record events other than replace- 
ments, hence the number of disk events given in the table 
is not necessarily equal to the number of replacements or 
failures. The ARR values for the data sets can therefore 
not be directly computed from Table 1. 

Below we describe each data set and the environment 
it comes from in more detail. 

HPCI1 is a five year log of hardware replacements 
collected from a 765 node high-performance computing 
cluster. Each of the 765 nodes is a 4-way SMP with 4 GB 
of memory and three to four 18GB 10K rpm SCSI drives. 
Of these nodes, 64 are used as filesystem nodes con- 
taining, in addition to the three to four 18GB drives, 17 
36GB 10K rpm SCSI drives. The applications running 
on this system are typically large-scale scientific simu- 
lations or visualization applications. The data contains, 
for each hardware replacement that was recorded during 
the five year lifetime of this system, when the problem 
started, which node and which hardware component was 
affected, and a brief description of the corrective action. 

HPC2 is a record of disk replacements observed on 
the compute nodes of a 256 node HPC cluster. Each 
node is a 4-way SMP with 16 GB of memory and con- 
tains two 36GB 10K rpm SCSI drives, except for eight 
of the nodes, which contain eight 36GB 10K rpm SCSI 
drives each. The applications running on this system are 
typically large-scale scientific simulations or visualiza- 
tion applications. For each disk replacement, the data set 
records the number of the affected node, the start time of 
the problem, and the slot number of the replaced drive. 

HPC3 is a record of disk replacements observed on 
a 1,532 node HPC cluster. Each node is equipped with 
eight CPUs and 32GB of memory. Each node, except for 
four login nodes, has two 146GB 15K rpm SCSI disks. 
In addition, 11,000 7200 rpm 250GB SATA drives are 
used in an external shared filesystem and 144 73GB 15K 
rpm SCSI drives are used for the filesystem metadata. 
The applications running on this system are typically 
large-scale scientific simulations or visualization appli- 
cations. For each disk replacement, the data set records 
the day of the replacement. 

The HPC4 data set is a warranty service log of disk re- 
placements. It covers three types of SATA drives used in 
dozens of separately managed HPC clusters. For the first 
type of drive, the data spans three years, for the other two 
types it spans a little less than a year. The data records, 
for each of the 13,618 drives, when it was first shipped 
and when (if ever) it was replaced in the field. 

COMI is a log of hardware failures recorded by an 
internet service provider and drawing from multiple dis- 


tributed sites. Each record in the data contains a times- 
tamp of when the failure was repaired, information on 
the failure symptoms, and a list of steps that were taken 
to diagnose and repair the problem. The data does not 
contain information on when each failure actually hap- 
pened, only when repair took place. The data covers a 
population of 26,734 10K rpm SCSI disk drives. The to- 
tal number of servers in the monitored sites is not known. 

COM2 is a warranty service log of hardware failures 
recorded on behalf of an internet service provider aggre- 
gating events in multiple distributed sites. Each failure 
record contains a repair code (e.g. “Replace hard drive”) 
and the time when the repair was finished. Again there is 
no information on the start time of each failure. The log 
does not contain entries for failures of disks that were re- 
placed in the customer site by hot-swapping in a spare 
disk, since the data was created by the warranty pro- 
cessing, which does not participate in on-site hot-swap 
replacements. To account for the missing disk replace- 
ments we obtained numbers for the periodic replenish- 
ments of on-site spare disks from the internet service 
provider. The size of the underlying system changed sig- 
nificantly during the measurement period, starting with 
420 servers in 2004 and ending with 9,232 servers in 
2006. We obtained quarterly hardware purchase records 
covering this time period to estimate the size of the disk 
population in our ARR analysis. 

The COM3 data set comes from a large external stor- 
age system used by an internet service provider and com- 
prises four populations of different types of FC disks (see 
Table 1). While this data was gathered in 2005, the sys- 
tem has some legacy components that were as old as from 
1998 and were known to have been physically moved af- 
ter initial installation. We did not include these “obso- 
lete” disk replacements in our analysis. COM3 differs 
from the other data sets in that it provides only aggregate 
statistics of disk failures, rather than individual records 
for each failure. The data contains the counts of disks 
that failed and were replaced in 2005 for each of the four 
disk populations. 


2.4 Statistical methods 


We characterize an empirical distribution using two im- 
port metrics: the mean and the squared coefficient of 
variation (C°). The squared coefficient of variation is a 
measure of the variability of a distribution and is defined 
as the squared standard deviation divided by the squared 
mean. The advantage of using the squared coefficient of 
variation as a measure of variability, rather than the vari- 
ance or the standard deviation, is that it is normalized by 
the mean, and so allows comparison of variability across 
distributions with different means. 

We also consider the empirical cumulative distribu- 


tion function (CDF) and how well it is fit by four prob- 
ability distributions commonly used in reliability theory: 
the exponential distribution; the Weibull distribution; the 
gamma distribution; and the lognormal distribution. We 
parameterize the distributions through maximum likeli- 
hood estimation and evaluate the goodness of fit by vi- 
sual inspection, the negative log-likelihood and the chi- 
square tests. 

We will also discuss the hazard rate of the distribu- 
tion of time between replacements. In general, the hazard 
rate of a random variable t with probability distribution 
f(t) and cumulative distribution function F (t) is defined 


as [25] (o) 
__ ft 
Olea Barres 


Intuitively, if the random variable ¢ denotes the time be- 
tween failures, the hazard rate h(t) describes the instanta- 
neous failure rate as a function of the time since the most 
recently observed failure. An important property of t’s 
distribution is whether its hazard rate is constant (which 
is the case for an exponential distribution) or increasing 
or decreasing. A constant hazard rate implies that the 
probability of failure at a given point in time does not 
depend on how long it has been since the most recent 
failure. An increasing hazard rate means that the proba- 
bility of a failure increases, if the time since the last fail- 
ure has been long. A decreasing hazard rate means that 
the probability of a failure decreases, if the time since the 
last failure has been long. 

The hazard rate is often studied for the distribution of 
lifetimes. It is important to note that we will focus on the 
hazard rate of the time between disk replacements, and 
not the hazard rate of disk lifetime distributions. 

Since we are interested in correlations between disk 
failures we need a measure for the degree of correlation. 
The autocorrelation function (ACF) measures the corre- 
lation of a random variable with itself at different time 
lags l. The ACF, for example, can be used to determine 
whether the number of failures in one day is correlated 
with the number of failures observed / days later. The au- 
tocorrelation coefficient can range between 1 (high pos- 
itive correlation) and -1 (high negative correlation). A 
value of zero would indicate no correlation, supporting 
independence of failures per day. 

Another aspect of the failure process that we will study 
is long-range dependence. Long-range dependence mea- 
sures the memory of a process, in particular how quickly 
the autocorrelation coefficient decays with growing lags. 
The strength of the long-range dependence is quanti- 
fied by the Hurst exponent. A series exhibits long-range 
dependence if the Hurst exponent, H, is 0.5 < H < 1. 
We use the Selfis tool [14] to obtain estimates of the 
Hurst parameter using five different methods: the abso- 
lute value method, the variance method, the R/S method, 
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Table 2: Node outages that were attributed to hardware 
problems broken down by the responsible hardware com- 
ponent. This includes all outages, not only those that re- 
quired replacement of a hardware component. 


the periodogram method, and the Whittle estimator. A 
brief introduction to long-range dependence and a de- 
scription of the Hurst parameter estimators is provided 
in [15]. 


3 Comparing disk replacement frequency 
with that of other hardware components 


The reliability of a system depends on all its components, 
and not just the hard drive(s). A natural question is there- 
fore what the relative frequency of drive failures is, com- 
pared to that of other types of hardware failures. To an- 
swer this question we consult data sets HPC1, COMI, 
and COM2, since these data sets contain records for all 
types of hardware replacements, not only disk replace- 
ments. Table 3 shows, for each data set, a list of the 
ten most frequently replaced hardware components and 
the fraction of replacements made up by each compo- 
nent. We observe that while the actual fraction of disk 
replacements varies across the data sets (ranging from 
20% to 50%), it makes up a significant fraction in all 
three cases. In the HPC1 and COM2 data sets, disk 
drives are the most commonly replaced hardware com- 
ponent accounting for 30% and 50% of all hardware re- 
placements, respectively. In the COMI data set, disks 
are a close runner-up accounting for nearly 20% of all 
hardware replacements. 

While Table 3 suggests that disks are among the most 
commonly replaced hardware components, it does not 
necessarily imply that disks are less reliable or have a 
shorter lifespan than other hardware components. The 
number of disks in the systems might simply be much 
larger than that of other hardware components. In order 
to compare the reliability of different hardware compo- 
nents, we need to normalize the number of component 
replacements by the component’s population size. 

Unfortunately, we do not have, for any of the systems, 
exact population counts of all hardware components. 
However, we do have enough information in HPC1 to es- 
timate counts of the four most frequently replaced hard- 
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Table 3: Relative frequency of hardware component replacements for the ten most frequently replaced components in 
systems HPC1, COMI and COM2, respectively. Abbreviations are taken directly from service data and are not known 


to have identical definitions across data sets. 


ware components (CPU, memory, disks, motherboards). 
We estimate that there is a total of 3,060 CPUs, 3,060 
memory dimms, and 765 motherboards, compared to a 
disk population of 3,406. Combining these numbers with 
the data in Table 3, we conclude that for the HPC1 sys- 
tem, the rate at which in five years of use a memory 
dimm was replaced is roughly comparable to that of a 
hard drive replacement; a CPU was about 2.5 times less 
often replaced than a hard drive; and a motherboard was 
50% less often replaced than a hard drive. 


The above discussion covers only failures that re- 
quired a hardware component to be replaced. When run- 
ning a large system one is often interested in any hard- 
ware failure that causes a node outage, not only those 
that necessitate a hardware replacement. We therefore 
obtained the HPC1 troubleshooting records for any node 
outage that was attributed to a hardware problem, in- 
cluding problems that required hardware replacements 
as well as problems that were fixed in some other way. 
Table 2 gives a breakdown of all records in the trou- 
bleshooting data, broken down by the hardware com- 
ponent that was identified as the root cause. We ob- 
serve that 16% of all outage records pertain to disk drives 
(compared to 30% in Table 3), making it the third most 
common root cause reported in the data. The two most 
commonly reported outage root causes are CPU and 
memory, with 44% and 29%, respectively. 


For a complete picture, we also need to take the sever- 
ity of an anomalous event into account. A closer look 
at the HPC1 troubleshooting data reveals that a large 
number of the problems attributed to CPU and memory 
failures were triggered by parity errors, i.e. the number 
of errors is too large for the embedded error correcting 
code to correct them. In those cases, a simple reboot 
will bring the affected node back up. On the other hand, 
the majority of the problems that were attributed to hard 


disks (around 90%) lead to a drive replacement, which is 
a more expensive and time-consuming repair action. 

Ideally, we would like to compare the frequency of 
hardware problems that we report above with the fre- 
quency of other types of problems, such software fail- 
ures, network problems, etc. Unfortunately, we do not 
have this type of information for the systems in Table 1. 
However, in recent work [27] we have analyzed failure 
data covering any type of node outage, including those 
caused by hardware, software, network problems, en- 
vironmental problems, or operator mistakes. The data 
was collected over a period of 9 years on more than 20 
HPC clusters and contains detailed root cause informa- 
tion. We found that, for most HPC systems in this data, 
more than 50% of all outages are attributed to hardware 
problems and around 20% of all outages are attributed to 
software problems. Consistently with the data in Table 2, 
the two most common hardware components to cause a 
node outage are memory and CPU. The data of this re- 
cent study [27] is not used in this paper because it does 
not contain information about storage replacements. 


4 Disk replacement rates 


4.1 Disk replacements and MTTF 


In the following, we study how field experience with 
disk replacements compares to datasheet specifications 
of disk reliability. Figure 1 shows the datasheet AFRs 
(horizontal solid and dashed line), the observed ARRs 
for each of the seven data sets and the weighted average 
ARR for all disks less than five years old (dotted line). 
For HPC1, HPC3, HPC4 and COM3, which cover dif- 
ferent types of disks, the graph contains several bars, one 
for each type of disk, in the left-to-right order of the cor- 
responding top-to-bottom entries in Table 1. Since at this 
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Figure 1: Comparison of datasheet AFRs (solid and dashed line in the graph) and ARRs observed in the field. Each 
bar in the graph corresponds to one row in Table 1. The dotted line represents the weighted average over all data sets. 
Only disks within the nominal lifetime of five years are included, i.e. there is no bar for the COM3 drives that were 
deployed in 1998. The third bar for COM3 in the graph is cut off — its ARR is 13.5%. 


point we are not interested in wearout effects after the 
end of a disk’s nominal lifetime, we have included in Fig- 
ure | only data for drives within their nominal lifetime of 
five years. In particular, we do not include a bar for the 
fourth type of drives in COM3 (see Table 1), which were 
deployed in 1998 and were more than seven years old at 
the end of the data collection. These possibly “obsolete” 
disks experienced an ARR, during the measurement pe- 
riod, of 24%. Since these drives are well outside the ven- 
dor’s nominal lifetime for disks, it is not surprising that 
the disks might be wearing out. All other drives were 
within their nominal lifetime and are included in the fig- 
ure. 

Figure 1 shows a significant discrepancy between 
the observed ARR and the datasheet AFR for all data 
sets. While the datasheet AFRs are between 0.58% and 
0.88%, the observed ARRs range from 0.5% to as high 
as 13.5%. That is, the observed ARRs by data set and 
type, are by up to a factor of 15 higher than datasheet 
AFRs. 

Most commonly, the observed ARR values are in the 
3% range. For example, the data for HPC1, which covers 
almost exactly the entire nominal lifetime of five years 
exhibits an ARR of 3.4% (significantly higher than the 
datasheet AFR of 0.88%). The average ARR over all data 
sets (weighted by the number of drives in each data set) 
is 3.01%. Even after removing all COM3 data, which 
exhibits the highest ARRs, the average ARR was still 
2.86%, 3.3 times higher than 0.88%. 

It is interesting to observe that for these data sets there 
is no significant discrepancy between replacement rates 
for SCSI and FC drives, commonly represented as the 
most reliable types of disk drives, and SATA drives, fre- 
quently described as lower quality. For example, the 


ARRs of drives in the HPC4 data set, which are exclu- 
sively SATA drives, are among the lowest of all data 
sets. Moreover, the HPC3 data set includes both SCSI 
and SATA drives (as part of the same system in the same 
operating environment) and they have nearly identical re- 
placement rates. Of course, these HPC3 SATA drives 
were decommissioned because of media error rates at- 
tributed to lubricant breakdown (recall Section 2.1), our 
only evidence of a bad batch, so perhaps more data is 
needed to better understand the impact of batches in 
overall quality. 

It is also interesting to observe that the only drives that 
have an observed ARR below the datasheet AFR are the 
second and third type of drives in data set HPC4. One 
possible reason might be that these are relatively new 
drives, all less than one year old (recall Table 1). Also, 
these ARRs are based on only 16 replacements, perhaps 
too little data to draw a definitive conclusion. 

A natural question arises: why are the observed disk 
replacement rates so much higher in the field data than 
the datasheet MTTF would suggest, even for drives in 
the first years of operation. As discussed in Sections 2.1 
and 2.2, there are multiple possible reasons. 

First, customers and vendors might not always agree 
on the definition of when a drive is “faulty”. The fact 
that a disk was replaced implies that it failed some (pos- 
sibly customer specific) health test. When a health test 
is conservative, it might lead to replacing a drive that the 
vendor tests would find to be healthy. Note, however, 
that even if we scale down the ARRs in Figure 1 to 57% 
of their actual values, to estimate the fraction of drives 
returned to the manufacturer that fail the latter’s health 
test [1], the resulting AFR estimates are still more than a 
factor of two higher than datasheet AFRs in most cases. 


Second, datasheet MTTFs are typically determined 
based on accelerated (stress) tests, which make certain 
assumptions about the operating conditions under which 
the disks will be used (e.g. that the temperature will 
always stay below some threshold), the workloads and 
“duty cycles” or powered-on hours patterns, and that cer- 
tain data center handling procedures are followed. In 
practice, operating conditions might not always be as 
ideal as assumed in the tests used to determine datasheet 
MTTFs. A more detailed discussion of factors that can 
contribute to a gap between expected and measured drive 
reliability is given by Elerath and Shah [6]. 

Below we summarize the key observations of this 
section. 


Observation 1: Variance between datasheet MTTF and 
disk replacement rates in the field was larger than we 
expected. The weighted average ARR was 3.4 times 
larger than 0.88%, corresponding to a datasheet MTTF 
of 1,000,000 hours. 


Observation 2: For older systems (5-8 years of age), 
data sheet MTTFs underestimated replacement rates by 
as much as a factor of 30. 


Observation 3: Even during the first few years of a 
system’s lifetime (< 3 years), when wear-out is not ex- 
pected to be a significant factor, the difference between 
datasheet MTTF and observed time to disk replacement 
was as large as a factor of 6. 


Observation 4: In our data sets, the replacement rates 
of SATA disks are not worse than the replacement rates 
of SCSI or FC disks. This may indicate that disk- 
independent factors, such as operating conditions, usage 
and environmental factors, affect replacement rates more 
than component specific factors. However, the only ev- 
idence we have of a bad batch of disks was found in a 
collection of SATA disks experiencing high media error 
rates. We have too little data on bad batches to estimate 
the relative frequency of bad batches by type of disk, 
although there is plenty of anecdotal evidence that bad 
batches are not unique to SATA disks. 


4.2 Age-dependent replacement rates 


One aspect of disk failures that single-value metrics such 
as MTTF and AFR cannot capture is that in real life fail- 
ure rates are not constant [5]. Failure rates of hardware 
products typically follow a “bathtub curve” with high 
failure rates at the beginning (infant mortality) and the 
end (wear-out) of the lifecycle. Figure 2 shows the fail- 
ure rate pattern that is expected for the life cycle of hard 
drives [4, 5, 33]. According to this model, the first year 
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Figure 2: Lifecycle failure pattern for hard drives [33]. 


of operation is characterized by early failures (or infant 
mortality). In years 2-5, the failure rates are approxi- 
mately in steady state, and then, after years 5-7, wear-out 
starts to kick in. 

The common concern, that MTTFs do not capture 
infant mortality, has lead the International Disk drive 
Equipment and Materials Association (IDEMA) to pro- 
pose a new standard for specifying disk drive reliability, 
based on the failure model depicted in Figure 2 [5, 33]. 
The new standard requests that vendors provide four dif- 
ferent MTTF estimates, one for the first 1-3 months of 
operation, one for months 4-6, one for months 7-12, and 
one for months 13-60. 

The goal of this section is to study, based on our field 
replacement data, how disk replacement rates in large- 
scale installations vary over a system’s life cycle. Note 
that we only see customer visible replacement. Any in- 
fant mortality failure caught in the manufacturing, sys- 
tem integration or installation testing are probably not 
recorded in production replacement logs. 

The best data sets to study replacement rates across the 
system life cycle are HPC1 and the first type of drives 
of HPC4. The reason is that these data sets span a long 
enough time period (5 and 3 years, respectively) and each 
cover a reasonably homogeneous hard drive population, 
allowing us to focus on the effect of age. 

We study the change in replacement rates as a function 
of age at two different time granularities, on a per-month 
and a per-year basis, to make it easier to detect both short 
term and long term trends. Figure 3 shows the annual re- 
placement rates for the disks in the compute nodes of sys- 
tem HPC! (left), the file system nodes of system HPC1 
(middle) and the first type of HPC4 drives (right), at a 
yearly granularity. 

We make two interesting observations. First, replace- 
ment rates in all years, except for year 1, are larger than 
the datasheet MTTF would suggest. For example, in 
HPC1’s second year, replacement rates are 20% larger 
than expected for the file system nodes, and a factor of 
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Figure 3: ARR for the first five years of system HPC1’s lifetime, for the compute nodes (left) and the file system nodes 
(middle). ARR for the first type of drives in HPC4 as a function of drive age in years (right). 
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Figure 4: ARR per month over the first five years of system HPCI1’s lifetime, for the compute nodes (left) and the file 
system nodes (middle). ARR for the first type of drives in HPC4 as a function of drive age in months (right). 


two larger than expected for the compute nodes. In year 
4 and year 5 (which are still within the nominal lifetime 
of these disks), the actual replacement rates are 7—10 
times higher than the failure rates we expected based on 
datasheet MTTF. 

The second observation is that replacement rates are 
rising significantly over the years, even during early 
years in the lifecycle. Replacement rates in HPC1 nearly 
double from year | to 2, or from year 2 to 3. This ob- 
servation suggests that wear-out may start much earlier 
than expected, leading to steadily increasing replacement 
rates during most of a system’s useful life. This is an in- 
teresting observation because it does not agree with the 
common assumption that after the first year of operation, 
failure rates reach a steady state for a few years, forming 
the “bottom of the bathtub”. 

Next, we move to the per-month view of replacement 
rates, shown in Figure 4. We observe that for the HPC1 
file system nodes there are no replacements during the 
first 12 months of operation, i.e. there’s is no detectable 
infant mortality. For HPC4, the ARR of drives is not 
higher in the first few months of the first year than the 
last few months of the first year. In the case of the 
HPC1 compute nodes, infant mortality is limited to the 


first month of operation and is not above the steady state 
estimate of the datasheet MTTF. Looking at the lifecy- 
cle after month 12, we again see continuously rising re- 
placement rates, instead of the expected “bottom of the 
bathtub”. 


Below we summarize the key observations of this 
section. 


Observation 5: Contrary to common and proposed 
models, hard drive replacement rates do not enter steady 
state after the first year of operation. Instead replacement 
rates seem to steadily increase over time. 


Observation 6: Early onset of wear-out seems to have 
a much stronger impact on lifecycle replacement rates 
than infant mortality, as experienced by end customers, 
even when considering only the first three or five years 
of a system’s lifetime. We therefore recommend that 
weat-out be incorporated into new standards for disk 
drive reliability. The new standard suggested by IDEMA 
does not take wear-out into account [5, 33]. 
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Figure 5: CDF of number of disk replacements per month in HPC1 


5 Statistical properties of disk failures 


In the previous sections, we have focused on aggregate 
statistics, e.g. the average number of disk replacements 
in a time period. Often one wants more information on 
the statistical properties of the time between failures than 
just the mean. For example, determining the expected 
time to failure for a RAID system requires an estimate on 
the probability of experiencing a second disk failure in a 
short period, that is while reconstructing lost data from 
redundant data. This probability depends on the underly- 
ing probability distribution and maybe poorly estimated 
by scaling an annual failure rate down to a few hours. 

The most common assumption about the statistical 
characteristics of disk failures is that they form a Pois- 
son process, which implies two key properties: 


1. Failures are independent. 


2. The time between failures follows an exponential 
distribution. 


The goal of this section is to evaluate how realistic the 
above assumptions are. We begin by providing statistical 
evidence that disk failures in the real world are unlikely 
to follow a Poisson process. We then examine each of the 
two key properties (independent failures and exponential 
time between failures) independently and characterize in 
detail how and where the Poisson assumption breaks. In 
our study, we focus on the HPC1 data set, since this is the 
only data set that contains precise timestamps for when 
a problem was detected (rather than just timestamps for 
when repair took place). 


5.1 The Poisson assumption 


The Poisson assumption implies that the number of fail- 
ures during a given time interval (e.g. a week or a month) 
is distributed according to the Poisson distribution. Fig- 
ure 5 (left) shows the empirical CDF of the number of 
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disk replacements observed per month in the HPC1 data 
set, together with the Poisson distribution fit to the data’s 
observed mean. 


We find that the Poisson distribution does not provide 
a good visual fit for the number of disk replacements per 
month in the data, in particular for very small and very 
large numbers of replacements in a month. For example, 
under the Poisson distribution the probability of seeing 
> 20 failures in a given month is less than 0.0024, yet 
we see 20 or more disk replacements in nearly 20% of 
all months in HPC1’s lifetime. Similarly, the probability 
of seeing zero or one failure in a given month is only 
0.0003 under the Poisson distribution, yet in 20% of all 
months in HPC1’s lifetime we observe zero or one disk 
replacement. 


A chi-square test reveals that we can reject the hypoth- 
esis that the number of disk replacements per month fol- 
lows a Poisson distribution at the 0.05 significance level. 
All above results are similar when looking at the distribu- 
tion of number of disk replacements per day or per week, 
rather than per month. 


One reason for the poor fit of the Poisson distribution 
might be that failure rates are not steady over the life- 
time of HPC1. We therefore repeat the same process for 
only part of HPC1’s lifetime. Figure 5 (right) shows the 
distribution of disk replacements per month, using only 
data from years 2 and 3 of HPC1. The Poisson distri- 
bution achieves a better fit for this time period and the 
chi-square test cannot reject the Poisson hypothesis at a 
significance level of 0.05. Note, however, that this does 
not necessarily mean that the failure process during years 
2 and 3 does follow a Poisson process, since this would 
also require the two key properties of a Poisson process 
(independent failures and exponential time between fail- 
ures) to hold. We study these two properties in detail in 
the next two sections. 
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Figure 6: Autocorrelation function for the number of disk replacements per week computed across the entire lifetime 
of the HPCI system (left) and computed across only one year of HPC1’s operation (right). 
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Figure 7: Expected number of disk replacements in a 
week depending on the number of disk replacements in 
the previous week. 


5.2 Correlations 


In this section, we focus on the first key property of 
a Poisson process, the independence of failures. Intu- 
itively, it is clear that in practice failures of disks in the 
same system are never completely independent. The fail- 
ure probability of disks depends for example on many 
factors, such as environmental factors, like temperature, 
that are shared by all disks in the system. When the tem- 
perature in a machine room is far outside nominal values, 
all disks in the room experience a higher than normal 
probability of failure. The goal of this section is to statis- 
tically quantify and characterize the correlation between 
disk replacements. 

We start with a simple test in which we determine the 
correlation of the number of disk replacements observed 
in successive weeks or months by computing the corre- 
lation coefficient between the number of replacements in 
a given week or month and the previous week or month. 
For data coming from a Poisson processes we would ex- 
pect correlation coefficients to be close to 0. Instead we 
find significant levels of correlations, both at the monthly 
and the weekly level. 
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The correlation coefficient between consecutive weeks 
is 0.72, and the correlation coefficient between consecu- 
tive months is 0.79. Repeating the same test using only 
the data of one year at a time, we still find significant lev- 
els of correlation with correlation coefficients of 0.4-0.8. 


Statistically, the above correlation coefficients indicate 
a strong correlation, but it would be nice to have a more 
intuitive interpretation of this result. One way of think- 
ing of the correlation of failures is that the failure rate in 
one time interval is predictive of the failure rate in the 
following time interval. To test the strength of this pre- 
diction, we assign each week in HPC1’s life to one of 
three buckets, depending on the number of disk replace- 
ments observed during that week, creating a bucket for 
weeks with small, medium, and large number of replace- 
ments, respectively !. The expectation is that a week that 
follows a week with a “small” number of disk replace- 
ments is more likely to see a small number of replace- 
ments, than a week that follows a week with a “large” 
number of replacements. However, if failures are inde- 
pendent, the number of replacements in a week will not 
depend on the number in a prior week. 

Figure 7 (left) shows the expected number of disk re- 
placements in a week of HPC1’s lifetime as a function 
of which bucket the preceding week falls in. We ob- 
serve that the expected number of disk replacements in 
a week varies by a factor of 9, depending on whether the 
preceding week falls into the first or third bucket, while 
we would expect no variation if failures were indepen- 
dent. When repeating the same process on the data of 
only year 3 of HPC1’s lifetime, we see a difference of a 
close to factor of 2 between the first and third bucket. 

So far, we have only considered correlations between 
successive time intervals, e.g. between two successive 
weeks. A more general way to characterize correlations 
is to study correlations at different time lags by using the 
autocorrelation function. Figure 6 (left) shows the auto- 
correlation function for the number of disk replacements 
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Figure 8: Distribution of time between disk replacements across all nodes in HPC1. 


per week computed across the HPC! data set. For a sta- 
tionary failure process (e.g. data coming from a Poisson 
process) the autocorrelation would be close to zero at all 
lags. Instead, we observe strong autocorrelation even for 
large lags in the range of 100 weeks (nearly 2 years). 

We repeated the same autocorrelation test for only 
parts of HPC1’s lifetime and find similar levels of au- 
tocorrelation. Figure 6 (right), for example, shows the 
autocorrelation function computed only on the data of 
the third year of HPC1’s life. Correlation is significant 
for lags in the range of up to 30 weeks. 

Another measure for dependency is long range 
dependence, as quantified by the Hurst exponent H. The 
Hurst exponent measures how fast the autocorrelation 
functions drops with increasing lags. A Hurst parameter 
between 0.5-1 signifies a statistical process with a long 
memory and a slow drop of the autocorrelation function. 
Applying several different estimators (see Section 2) to 
the HPC1 data, we determine a Hurst exponent between 
0.6-0.8 at the weekly granularity. These values are 
comparable to Hurst exponents reported for Ethernet 
traffic, which is known to exhibit strong long range 
dependence [16]. 


Observation 7: Disk replacement counts exhibit signifi- 
cant levels of autocorrelation. 


Observation 8: Disk replacement counts exhibit long- 
range dependence. 


5.3 Distribution of time between failure 


In this section, we focus on the second key property of 
a Poisson failure process, the exponentially distributed 
time between failures. Figure 8 shows the empirical cu- 
mulative distribution function of time between disk re- 
placements as observed in the HPC1 system and four 
distributions matched to it. 

We find that visually the gamma and Weibull distribu- 
tions are the best fit to the data, while exponential and 
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lognormal distributions provide a poorer fit. This agrees 
with results we obtain from the negative log-likelihood, 
that indicate that the Weibull distribution is the best fit, 
closely followed by the gamma distribution. Perform- 
ing a Chi-Square-Test, we can reject the hypothesis that 
the underlying distribution is exponential or lognormal 
at a significance level of 0.05. On the other hand the hy- 
pothesis that the underlying distribution is a Weibull or a 
gamma cannot be rejected at a significance level of 0.05. 


Figure 8 (right) shows a close up of the empirical 
CDF and the distributions matched to it, for small time- 
between-replacement values (less than 24 hours). The 
reason that this area is particularly interesting is that a 
key application of the exponential assumption is in esti- 
mating the time until data loss in a RAID system. This 
time depends on the probability of a second disk fail- 
ure during reconstruction, a process which typically lasts 
on the order of a few hours. The graph shows that the 
exponential distribution greatly underestimates the prob- 
ability of a second failure during this time period. For 
example, the probability of seeing two drives in the clus- 
ter fail within one hour is four times larger under the real 
data, compared to the exponential distribution. The prob- 
ability of seeing two drives in the cluster fail within the 
same 10 hours is two times larger under the real data, 
compared to the exponential distribution. 


The poor fit of the exponential distribution might be 
due to the fact that failure rates change over the lifetime 
of the system, creating variability in the observed times 
between disk replacements that the exponential distribu- 
tion cannot capture. We therefore repeated the above 
analysis considering only segments of HPC1’s lifetime. 
Figure 9 shows as one example the results from ana- 
lyzing the time between disk replacements in year 3 of 
HPC1’s operation. While visually the exponential distri- 
bution now seems a slightly better fit, we can still reject 
the hypothesis of an underlying exponential distribution 
at a significance level of 0.05. The same holds for other 
1-year and even 6-month segments of HPC1’s lifetime. 
This leads us to believe that even during shorter segments 
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Figure 9: Distribution of time between disk replacements 
across all nodes in HPCI for only year 3 of operation. 


of HPC1’s lifetime the time between replacements is not 
realistically modeled by an exponential distribution. 

While it might not come as a surprise that the sim- 
ple exponential distribution does not provide as good a 
fit as the more flexible two-parameter distributions, an 
interesting question is what properties of the empirical 
time between failure make it different from a theoretical 
exponential distribution. We identify as a first differenti- 
ating feature that the data exhibits higher variability than 
a theoretical exponential distribution. The data has a C? 
of 2.4, which is more than two times higher than the C 2 
of an exponential distribution, which is 1. 

A second differentiating feature is that the time be- 
tween disk replacements in the data exhibits decreasing 
hazard rates. Recall from Section 2.4 that the hazard 
rate function measures how the time since the last fail- 
ure influences the expected time until the next failure. 
An increasing hazard rate function predicts that if the 
time since a failure is long then the next failure is com- 
ing soon. And a decreasing hazard rate function predicts 
the reverse. The table below summarizes the parameters 
for the Weibull and gamma distribution that provided the 
best fit to the data. 


Distribution | Distribution / Parameters | Parameters 
Distribution Parameters _| Gamma 
Shape Scale | Shape Scale 
0.73 0.037 0.65 176.4 
0.76 0.013 0.64 482.6 
0.71 0.049 0.59 160.9 


HPC1 Data 


Compute nodes 
Filesystem nodes 
All nodes 


Disk replacements in the filesystem nodes, as well as the 
compute nodes, and across all nodes, are fit best with 
gamma and Weibull distributions with a shape parameter 
less than 1, a clear indicator of decreasing hazard rates. 
Figure 10 illustrates the decreasing hazard rates of the 
time between replacements by plotting the expected re- 
maining time until the next disk replacement (Y-axis) as 
a function of the time since the last disk replacement (X- 
axis). We observe that right after a disk was replaced the 
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Figure 10: Illustration of decreasing hazard rates 


expected time until the next disk replacement becomes 
necessary was around 4 days, both for the empirical data 
and the exponential distribution. In the case of the em- 
pirical data, after surviving for ten days without a disk 
replacement the expected remaining time until the next 
replacement had grown from initially 4 to 10 days; and 
after surviving for a total of 20 days without disk replace- 
ments the expected time until the next failure had grown 
to 15 days. In comparison, under an exponential distri- 
bution the expected remaining time stays constant (also 
known as the memoryless property). 

Note, that the above result is not in contradiction 
with the increasing replacement rates we observed in 
Section 4.2 as a function of drive age, since here we look 
at the distribution of the time between disk replacements 
in a cluster, not disk lifetime distributions (i.e. how long 
did a drive live until it was replaced). 


Observation 9: The hypothesis that time between disk 
replacements follows an exponential distribution can be 
rejected with high confidence. 


Observation 10: The time between disk replacements 
has a higher variability than that of an exponential 
distribution. 


Observation 11: The distribution of time between disk 
replacements exhibits decreasing hazard rates, that is, 
the expected remaining time until the next disk was 
replaced grows with the time it has been since the last 
disk replacement. 


6 Related work 


There is very little work published on analyzing failures 
in real, large-scale storage systems, probably as a result 
of the reluctance of the owners of such systems to release 
failure data. 


Among the few existing studies is the work by Tala- 
gala et al. [29], which provides a study of error logs ina 
research prototype storage system used for a web server 
and includes a comparison of failure rates of different 
hardware components. They identify SCSI disk enclo- 
sures as the least reliable components and SCSI disks as 
one of the most reliable component, which differs from 
our results. 

In a recently initiated effort, Schwarz et al. [28] have 
started to gather failure data at the Internet Archive, 
which they plan to use to study disk failure rates and 
bit rot rates and how they are affected by different envi- 
ronmental parameters. In their preliminary results, they 
report ARR values of 2—6% and note that the Internet 
Archive does not seem to see significant infant mortality. 
Both observations are in agreement with our findings. 

Gray [31] reports the frequency of uncorrectable read 
errors in disks and finds that their numbers are smaller 
than vendor data sheets suggest. Gray also provides ARR 
estimates for SCSI and ATA disks, in the range of 3-6%, 
which is in the range of ARRs that we observe for SCSI 
drives in our data sets. 

Pinheiro et al. analyze disk replacement data from a 
large population of serial and parallel ATA drives [23]. 
They report ARR values ranging from 1.7% to 8.6%, 
which agrees with our results. The focus of their study 
is on the correlation between various system parame- 
ters and drive failures. They find that while temperature 
and utilization exhibit much less correlation with failures 
than expected, the value of several SMART counters cor- 
relate highly with failures. For example, they report that 
after a scrub error drives are 39 times more likely to fail 
within 60 days than drives without scrub errors and that 
44% of all failed drives had increased SMART counts in 
at least one of four specific counters. 

Many have criticized the accuracy of MTTF based 
failure rate predictions and have pointed out the need for 
more realistic models. A particular concern is the fact 
that a single MTTF value cannot capture life cycle pat- 
terns [4, 5, 33]. Our analysis of life cycle patterns shows 
that this concern is justified, since we find failure rates 
to vary quite significantly over even the first two to three 
years of the life cycle. However, the most common life 
cycle concern in published research is underrepresenting 
infant mortality. Our analysis does not support this. In- 
stead we observe significant underrepresentation of the 
early onset of wear-out. 

Early work on RAID systems [8] provided some sta- 
tistical analysis of time between disk failures for disks 
used in the 1980s, but didn’t find sufficient evidence to 
reject the hypothesis of exponential times between fail- 
ure with high confidence. However, time between failure 
has been analyzed for other, non-storage data in several 
studies [11, 17, 26, 27, 30, 32]. Four of the studies use 
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distribution fitting and find the Weibull distribution to be 
a good fit [11, 17, 27, 32], which agrees with our results. 
All studies looked at the hazard rate function, but come to 
different conclusions. Four of them [11, 17, 27, 32] find 
decreasing hazard rates (Weibull shape parameter < 0.5). 
Others find that hazard rates are flat [30], or increasing 
[26]. We find decreasing hazard rates with Weibull shape 
parameter of 0.7-0.8. 

Large-scale failure studies are scarce, even when con- 
sidering IT systems in general and not just storage sys- 
tems. Most existing studies are limited to only a few 
months of data, covering typically only a few hundred 
failures [13, 20, 21, 26, 30, 32]. Many of the most com- 
monly cited studies on failure analysis stem from the late 
80’s and early 90’s, when computer systems where sig- 
nificantly different from today [9, 10, 12, 17, 18, 19, 30]. 


7 Conclusion 


Many have pointed out the need for a better understand- 
ing of what disk failures look like in the field. Yet hardly 
any published work exists that provides a large-scale 
study of disk failures in production systems. As a first 
step towards closing this gap, we have analyzed disk re- 
placement data from a number of large production sys- 
tems, spanning more than 100,000 drives from at least 
four different vendors, including drives with SCSI, FC 
and SATA interfaces. Below is a summary of a few of 
our results. 


Large-scale installation field usage appears to differ 
widely from nominal datasheet MTTF conditions. 
The field replacement rates of systems were signif- 
icantly larger than we expected based on datasheet 
MTTFEs. 


For drives less than five years old, field replacement 
rates were larger than what the datasheet MTTF 
suggested by a factor of 2-10. For five to eight year 
old drives, field replacement rates were a factor of 
30 higher than what the datasheet MTTF suggested. 


Changes in disk replacement rates during the first 
five years of the lifecycle were more dramatic than 
often assumed. While replacement rates are often 
expected to be in steady state in year 2-5 of opera- 
tion (bottom of the “bathtub curve”), we observed 
a continuous increase in replacement rates, starting 
as early as in the second year of operation. 


In our data sets, the replacement rates of SATA 
disks are not worse than the replacement rates of 
SCSI or FC disks. This may indicate that disk- 
independent factors, such as operating conditions, 
usage and environmental factors, affect replacement 


rates more than component specific factors. How- 
ever, the only evidence we have of a bad batch 
of disks was found in a collection of SATA disks 
experiencing high media error rates. We have 
too little data on bad batches to estimate the rela- 
tive frequency of bad batches by type of disk, al- 
though there is plenty of anecdotal evidence that 
bad batches are not unique to SATA disks. 


e The common concern that MTTFs underrepresent 
infant mortality has led to the proposal of new stan- 
dards that incorporate infant mortality [33]. Our 
findings suggest that the underrepresentation of the 
early onset of wear-out is a much more serious fac- 
tor than underrepresentation of infant mortality and 
recommend to include this in new standards. 


e While many have suspected that the commonly 
made assumption of exponentially distributed time 
between failures/replacements is not realistic, pre- 
vious studies have not found enough evidence to 
prove this assumption wrong with significant sta- 
tistical confidence [8]. Based on our data analysis, 
we are able to reject the hypothesis of exponen- 
tially distributed time between disk replacements 
with high confidence. We suggest that researchers 
and designers use field replacement data, when pos- 
sible, or two parameter distributions, such as the 
Weibull distribution. 


e We identify as the key features that distinguish the 
empirical distribution of time between disk replace- 
ments from the exponential distribution, higher lev- 
els of variability and decreasing hazard rates. We 
find that the empirical distributions are fit well by 
a Weibull distribution with a shape parameter be- 
tween 0.7 and 0.8. 


e We also present strong evidence for the existence 
of correlations between disk replacement interar- 
rivals. In particular, the empirical data exhibits sig- 
nificant levels of autocorrelation and long-range de- 
pendence. 
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Notes 


'More precisely, we choose the cutoffs between the buckets such 
that each bucket contains the same number of samples (i.e. weeks) by 
using the 33th percentile and the 66th percentile of the empirical distri- 
bution as cutoffs between the buckets. 


2 This report was prepared as an account of work sponsored by an 
agency of the United States Government. Neither the United States 
Government nor any agency thereof, nor any of their employees, makes 
any warranty, express or implied, or assumes any legal liability or re- 
sponsibility for the accuracy, completeness, or usefulness of any in- 
formation, apparatus, product, or process disclosed, or represents that 
its use would not infringe privately owned rights. Reference herein to 
any specific commercial product, process, or service by trade name, 
trademark, manufacturer, or otherwise does not necessarily constitute 
or imply its endorsement, recommendation, or favoring by the United 
States Government or any agency thereof. The views and opinions of 
authors expressed herein do not necessarily state or reflect those of the 
United States Government or any agency thereof. 
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